RDMA transfers in mapreduce frameworks

ABSTRACT

Embodiments of the present invention provide methods, systems, and computer program products for transferring data in a MapReduce framework. In one embodiment, MapReduce jobs are performed such that data spills are stored by mapper systems in memory and are transferred to reducer systems via one-sided RDMA transfers, which can reduce CPU overhead of mapper systems and the latency of data transfer to reducer systems.

BACKGROUND OF THE INVENTION

The present invention relates generally to the field of MapReduceframeworks, and more particularly to management of data spills inMapReduce frameworks.

MapReduce frameworks provide the ability to process large data sets in adistributed fashion using a cluster of multiple computing nodes. In atypical MapReduce framework implementation, a plurality of mappers areeach assigned a portion of data (i.e., a split) from the data set onwhich to perform one or more tasks (e.g., executing a map script tocount occurrences of each word in a string). The output results of eachmapper are sorted (e.g., shuffling the output results such that resultspertaining to the same words are grouped together) and assigned toreducers, which in turn perform one or more reduce tasks (e.g.,executing a reduce script to sum all occurrence values for each word).Accordingly, the MapReduce framework not only allows large data sets tobe split between many mappers and reducers, but such mappers andreducers can each perform their respective tasks simultaneously, whichcan greatly improve the speed and efficiency with which processing jobscan be completed.

Typically, each mapper writes its output results to a memory buffer offinite size (e.g., 100 MB). When the buffer is full, contents of thebuffer are spilled to a local disk in a spill file, after whichadditional output results can be written to the buffer. After a mapperhas written its last output result, the spill files are merged andsorted into a single output file, which can be transmitted to anassigned reducer via TCP/IP.

SUMMARY

Embodiments of the present invention provide methods, systems, andcomputer program products for transferring data in a MapReduceframework. In one embodiment, one or more computer processors receive adata split assigned to a mapper system. A first fixed-address memoryregion is registered for the mapper system, and one or more mapper tasksare executed on the data split to generate output results. Generatedoutput results are spilled to the first fixed-address memory region, andgenerated output results are transferred from the first fixed-addressmemory region to a reducer system using remote direct memory access(RDMA).

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a block diagram of a backup system, in accordance with anembodiment of the present invention;

FIG. 2 is a flowchart illustrating operations of a mapper system in aMapReduce framework, in accordance with an embodiment of the presentinvention;

FIG. 3 is a flowchart illustrating operations of a reducer system in aMapReduce framework, in accordance with an embodiment; and

FIG. 4 is a block diagram of internal and external components of thecomputer systems of FIG. 1, in accordance with an embodiment of thepresent invention.

DETAILED DESCRIPTION

Embodiments of the present invention recognize that in a typicalMapReduce framework implementation, mapper systems store data spills todisk and transfer data spills to reducers via TCP/IP and HTTP, which canresult in decreased transfer speeds and performance. Embodiments of thepresent invention provide methods, systems, and computer programproducts for transferring data in a MapReduce framework in which dataspills are stored by mapper systems in memory and are transferred toreducer systems via one-sided RDMA transfers, which can reduce CPUoverhead of mapper systems and the latency of data transfer to reducersystems, and can improve performance of MapReduce jobs.

FIG. 1 is a functional block diagram of MapReduce system 100, inaccordance with an embodiment of the present invention. MapReduce system100 includes job management computer system 102, task managementcomputer system 106, and task management computer system 107, allinterconnected over network 104. Job management computer system 102,task management computer system 106, and task management computer system107 can be desktop computers, laptop computers, specialized computerservers, or any other computer systems known in the art. In certainembodiments, job management computer system 102, task managementcomputer system 106, and task management computer system 107 representcomputer systems utilizing clustered computers and components to act asa single pool of seamless resources when accessed through network 104.For example, such embodiments may be used in data center and cloudcomputing applications. In certain embodiments, job management computersystem 102, task management computer system 106, and task managementcomputer system 107 represent virtual machines. In general, jobmanagement computer system 102, task management computer system 106, andtask management computer system 107 are representative of any electronicdevices, or combination of electronic devices, capable of executingmachine-readable program instructions, as described in greater detailwith regard to FIG. 4.

Job management computer system 102 receives processing jobs from one ormore applications and distributes tasks for those processing jobs totask management computer system 106 and task management computer system107. For illustrative purposes, MapReduce system 100 is depicted in FIG.1 as having a single job management computer system 102 and two taskmanagement computer systems 106 and 107; however, it should beunderstood that MapReduce system 100 can comprise a cluster of anynumber of computing nodes that can serve as job management computersystems (e.g., JobTrackers) and any number of computing nodes that canserve as task management computer systems (e.g., TaskTrackers).

Task management computer system 106 includes mapper systems 108 a-n. Inthis embodiment, mapper systems 108 a-n each represent a Java VirtualMachine (JVM), and task management computer system 106 can instantiateone such JVM for each assigned task. Mapper systems 108 a-n can behosted locally on task management computer system 106 and/or can beremotely hosted on one or more other computer system accessible vianetwork 104. In other embodiments, other types of virtual machinesand/or hardware systems can be used to implement mapper systems 108 a-n.For illustrative purposes, embodiments of the present invention mayhereafter be discussed with respect to mapper system 108 a, it beingunderstood that, unless explicitly stated otherwise, the followingdiscussion also applies to any of mapper systems 108 b-n, depending onwhich of those mapper systems are assigned to one or more tasks.

Mapper systems 108 a-n each include mapper program 110, data splits 112,primary memory buffer 114, and remote direct memory access (RDMA) mapperbuffer 116. Job management computer system 102 provides assigned mappersystems 108 a-n with respective data splits of the larger data set to beprocessed in the processing job. Mapper program 110 processes datasplits 112 assigned by job management computer system 102 to execute oneor more mapper tasks and output results. In this embodiment, data splits112 are retrieved and stored locally on mapper system 108 a, such asusing one or more hard disk drives.

Primary memory buffer 114 is a memory buffer in which mapper program 110stores output results of executed mapper tasks. When primary memorybuffer 114 is full, the output results are written to RDMA mapper buffer116. Stated differently, mapper program 110 spills output results toRDMA mapper buffer 116, rather than to disk. In another embodiment,primary memory buffer 114 and RDMA mapper buffer 116 can be implementedas a single memory buffer.

RDMA mapper buffer 116 is a fixed-address memory region expressed as afixed memory address and a specified byte range following the fixedmemory address (i.e., a locked memory region that cannot be swapped bythe operating system). In this embodiment, RDMA mapper buffer 116 isoff-JVM heap (i.e., separate from dynamic memory used by the JVM) andRDMA mapper buffer 116 is sized such that it can store all outputresults of mapper system 108 a for the assigned tasks. Stateddifferently, RDMA mapper buffer 116 is sufficiently large such that nospilt data will be written to disk. In this embodiment, multiple RDMAmapper buffers 116 can be created to achieve various configurations. Forexample, as discussed in greater detail later in this specification, oneRDMA mapper buffer 116 can be created for each of reducer systems 118a-n that are assigned to mapper system 108 a (i.e., dedicated buffers),and/or RDMA mapper buffers 116 can be shared by multiple mapper systems108 a-n that are assigned to common reducer systems 118 a-n (i.e.,shared and reused buffers).

Task management computer system 107 includes reducer systems 118 a-n. Aspreviously discussed with regard to mapper systems 108 a-n, in thisembodiment, reducer systems 118 a-n each represent a JVM, and taskmanagement computer system 107 can instantiate one such JVM for eachassigned task. Reducer systems 118 a-n can be hosted locally on taskmanagement computer system 107 and/or can be remotely hosted on one ormore other computer system accessible via network 104. In otherembodiments, other types of virtual machines and/or hardware systems canbe used to implement reducer systems 118 a-n. For illustrative purposes,embodiments of the present invention may hereafter be discussed withrespect to reducer system 118 a, it being understood that, unlessexplicitly stated otherwise, the following discussion also applies toany of reducer systems 118 b-n, depending on which of those reducersystems are assigned to one or more tasks.

Reducer systems 118 a-n each include reducer program 120, primary memorybuffer 122, and RDMA reducer buffer 124. Each of reducer systems 118 a-nis assigned to one or more of mapper systems 108 a-n by job managementcomputer system 102. Reducer systems 118 a-n receive output results(e.g., partitions of data) from one or more mapper systems 108 a-n towhich they are assigned, and perform one or more reducer tasks on thoseoutput results. Again, for illustrative purposes, embodiments of thepresent invention may hereafter be discussed with respect to reducersystem 118 a, it being understood that, unless explicitly statedotherwise, the following discussion also applies to any of reducersystems 118 b-n, depending on which of those reducer systems areassigned.

Reducer program 120 of reducer system 118 a processes output results(e.g., data partitions) generated by mapper systems 108 a-n to whichreducer system 118 a is assigned, to merge the output results andexecute one or more reducer tasks on the merged output results.

Primary memory buffer 122 is a dynamic memory buffer used by reducersystem 118 a to store merged output results generated by mapper systems108 a-n to which reducer system 118 a is assigned. For example, primarymemory buffer 122 can be a JVM heap.

RDMA reducer buffer 124 is a fixed-address memory region expressed as afixed memory address and a specified byte range following the fixedmemory address (i.e., a locked memory region). RDMA reducer buffer 124is used by reducer system 118 a to receive and store output resultsgenerated by mapper systems 108 a-n to which reducer system 118 a isassigned, prior to merging and storing those results in primary memorybuffer 122. In this embodiment, RDMA reducer buffer 124 is off-JVM heap(i.e., separate from dynamic memory used by the JVM, such as primarymemory buffer 122), and RDMA reducer buffer 124 is sized such that itcan store all output files received from assigned mapper systems 108a-n.

Reducer program 120 performs one-sided RDMA transfers of data from RDMAmapper buffer 116 to RDMA reducer buffer 124. In one embodiment, RDMAtransfers are performed using InfiniBand technology over network 104. Inanother embodiment, RDMA transfers are performed using RDMA overconverged Ethernet (RoCE) technology over network 104. In general, anysuitable RDMA transfer technology known in the art can be used.

Network 104 can be, for example, a local area network (LAN), a wide areanetwork (WAN) such as the Internet, or a combination of the two, andinclude wired, wireless, or fiber optic connections. In general, network104 can use any combination of connections and protocols that willsupport communications between job management computer system 102,mapper systems 108 a-n, and reducer systems 118 a-n, including RDMAtransfers, in accordance with a desired embodiment of the invention.

FIG. 2 is a flowchart illustrating operations of a mapper system, inaccordance with an embodiment of the present invention. For illustrativepurposes, the following discussion will be made with respect to mappersystem 108 a.

Mapper program 110 receives a data split of a larger data set to beprocessed by mapper program 110 for the assigned task (operation 202).In this embodiment, mapper program 110 receives the data split from jobmanagement computer system 102 and stores the data split locally onmapper system 108 a.

Mapper program 110 registers a memory region to be used for an RDMAmapper buffer (operation 204). In this embodiment, mapper program 110registers a memory region expressed by a fixed memory address and aspecified byte range following the fixed memory address (i.e., a lockedmemory region) that is off-JVM heap or otherwise separate from dynamicmemory regions used by mapper system 108 a. The RDMA mapper buffer may,however, be located on the same one or more computer readable storagemedia as dynamic memory regions. In this embodiment, RDMA mapper buffersare sized such that no spilt data will be written to disk. Accordingly,mapper program 110 can determine a size for the RDMA mapper buffer to becreated based on an anticipated amount of spilt data (e.g., size andnumber of spill files historically created for similar tasks).

The number of RDMA mapper buffers that are created can be configuredbased on various considerations, such as the number of reducer systemsassigned for a processing job and/or the number of mapper systems thatare assigned to a particular reducer system. For example, where thereare multiple mapper systems hosted on a single machine that are assignedto a particular reducer system, mapper program 110 may create one ormore RDMA mapper buffers to be shared by those multiple mapper systems.Where there are a fewer number of mapper systems assigned to aparticular reducer system, mapper program 110 may create RDMA mapperbuffers dedicated for use by certain mapper systems.

Mapper program 110 executes one or more mapper tasks on the receiveddata split (operation 206). In this embodiment, mapper program 110executes one or more mapper tasks specified by mapper code (e.g., amapper script). For example, mapper code may be executed to countoccurrences of words or phrases within the data split.

Mapper program 110 outputs results of executing the one or more mappertasks on the received data split (operation 208). In this embodiment,mapper program 110 outputs and writes results to primary memory buffer114. After primary memory buffer 114 becomes full, mapper program 110spills the output results to the RDMA mapper buffer (operation 210);mapper program 110 does not spill data to disk. In this embodiment, theoutput results are divided into partitions corresponding to assignedreducer systems to which the output results should be sent, as specifiedby job management computer system 102.

After outputting all results for the received data split, mapper program110 notifies job management computer system 102 that processing of thedata split has been completed (operation 212). In this embodiment,mapper program 110 notifies job management computer system 102 bytransmitting a ready-to-read signal and an RDMA descriptor to jobmanagement computer system 102 via network 104. The RDMA descriptorcontains the fixed memory address and byte range of the RDMA mapperbuffer, along with a unique key to access the RDMA mapper bufferremotely.

The operations of FIG. 2 can be repeated for each data split received byeach assigned mapper system of MapReduce system 100. Accordingly, outputresults generated by mapper systems can be stored without having tospill data to disk and in a manner that facilitates RDMA transfer toassigned reducer systems.

FIG. 3 is a flowchart illustrating operations of a reducer system, inaccordance with an embodiment of the present invention. For illustrativepurposes, the following discussion is made with respect to reducersystem 118 a.

Reducer program 120 receives initiation information from job managementcomputer system 102 (operation 302). In this embodiment, such initiationinformation includes a job identifier (e.g., a unique number thatidentifies the job to which the assigned task belongs), an identifier ofthe one or more mapper systems to which it is assigned (e.g., mappersystem 108 a), and RDMA descriptors to be used for RDMA transfer ofoutput results stored in RDMA mapper buffers (e.g., RDMA mapper buffer116) of mapper systems 108 a-n to which the reducer system is assigned.

Reducer program 120 registers a memory region to be used for an RDMAreducer buffer (operation 304). In this embodiment, reducer program 120registers a fixed-address memory region, expressed as a fixed memoryaddress and a specified byte range following the fixed memory address(i.e., a locked memory region), that is off-JVM heap or otherwiseseparate from dynamic memory regions used by reducer system 118 a. TheRDMA reducer buffer may, however, be located on the same one or morecomputer readable storage media as dynamic memory regions. In thisembodiment, RDMA reducer buffers are sized such that no data must bespilled to disk. Accordingly, reducer program 120 can determine a sizefor the RDMA reducer buffer to be created based on the sizes of RDMAmapper buffers of mapper systems (e.g., RDMA mapper buffer 116 of mappersystem 108 a) to which reducer system 118 a is assigned.

Reducer program 120 initiates one-sided RDMA transfer of output resultsfrom RDMA mapper buffer 116 to RDMA reducer buffer 124 via network 104(operation 306). In this embodiment, RDMA transfer of the data isperformed using known RDMA transfer technologies, such as InfiniBandand/or RoCE.

Reducer program 120 sorts, merges, and stores results received frommapper system 108 a in primary memory buffer 122 (operation 308). In oneembodiment, reducer program 120 copies output results from RDMA reducerbuffer 124 to primary memory buffer 122, and sorts the output results inprimary memory buffer 122. In another embodiment, reducer program 120sorts output results in RDMA reducer buffer 124, and then stores thesorted output results in primary memory buffer 122.

Reducer program 120 executes one or more reducer tasks on the sortedoutput results stored in primary memory buffer 122 (operation 310). Inthis embodiment, reducer program 120 executes one or more reducer tasksspecified by reducer code (e.g., a reducer script). For example, reducercode may be executed to total all counts of occurrences of words orphrases within the data split.

The operations of FIG. 3 can be repeated for each partition and eachassigned reducer system of MapReduce system 100. Accordingly,embodiments of the present invention can be used to perform MapReducejobs using direct memory-to-memory transfer of spilt data from mappersto reducers, thereby improving the speed and efficiency with which theMapReduce jobs are performed.

FIG. 4 is a block diagram of internal and external components of acomputer system 400, which is representative the computer systems ofFIG. 1, in accordance with an embodiment of the present invention. Itshould be appreciated that FIG. 4 provides only an illustration of oneimplementation and does not imply any limitations with regard to theenvironments in which different embodiments may be implemented. Ingeneral, the components illustrated in FIG. 4 are representative of anyelectronic device capable of executing machine-readable programinstructions. Examples of computer systems, environments, and/orconfigurations that may be represented by the components illustrated inFIG. 4 include, but are not limited to, personal computer systems,server computer systems, thin clients, thick clients, laptop computersystems, tablet computer systems, cellular telephones (e.g., smartphones), multiprocessor systems, microprocessor-based systems, networkPCs, minicomputer systems, mainframe computer systems, and distributedcloud computing environments that include any of the above systems ordevices.

Computer system 400 includes communications fabric 402, which providesfor communications between one or more processors 404, memory 406,persistent storage 408, communications unit 412, and one or moreinput/output (I/O) interfaces 414. Communications fabric 402 can beimplemented with any architecture designed for passing data and/orcontrol information between processors (such as microprocessors,communications and network processors, etc.), system memory, peripheraldevices, and any other hardware components within a system. For example,communications fabric 402 can be implemented with one or more buses.

Memory 406 and persistent storage 408 are computer-readable storagemedia. In this embodiment, memory 406 includes random access memory(RAM) 416 and cache memory 418. In general, memory 406 can include anysuitable volatile or non-volatile computer-readable storage media.Software is stored in persistent storage 408 for execution and/or accessby one or more of the respective processors 404 via one or more memoriesof memory 406.

Persistent storage 408 may include, for example, a plurality of magnetichard disk drives. Alternatively, or in addition to magnetic hard diskdrives, persistent storage 408 can include one or more solid state harddrives, semiconductor storage devices, read-only memories (ROM),erasable programmable read-only memories (EPROM), flash memories, or anyother computer-readable storage media that is capable of storing programinstructions or digital information.

The media used by persistent storage 408 can also be removable. Forexample, a removable hard drive can be used for persistent storage 408.Other examples include optical and magnetic disks, thumb drives, andsmart cards that are inserted into a drive for transfer onto anothercomputer-readable storage medium that is also part of persistent storage408.

Communications unit 412 provides for communications with other computersystems or devices via a network (e.g., network 104). In this exemplaryembodiment, communications unit 412 includes network adapters orinterfaces such as a TCP/IP adapter cards, wireless Wi-Fi interfacecards, or 3G or 4G wireless interface cards or other wired or wirelesscommunication links. The network can comprise, for example, copperwires, optical fibers, wireless transmission, routers, firewalls,switches, gateway computers and/or edge servers. Software and data usedto practice embodiments of the present invention can be downloaded tojob management computer system 102, task management computer system 106,and task management computer system 107 through communications unit 412(e.g., via the Internet, a local area network or other wide areanetwork). From communications unit 412, the software and data can beloaded onto persistent storage 408.

One or more I/O interfaces 414 allow for input and output of data withother devices that may be connected to computer system 400. For example,I/O interface 414 can provide a connection to one or more externaldevices 420 such as a keyboard, computer mouse, touch screen, virtualkeyboard, touch pad, pointing device, or other human interface devices.External devices 420 can also include portable computer-readable storagemedia such as, for example, thumb drives, portable optical or magneticdisks, and memory cards. I/O interface 414 also connects to display 422.

Display 422 provides a mechanism to display data to a user and can be,for example, a computer monitor. Display 422 can also be an incorporateddisplay and may function as a touch screen, such as a built-in displayof a tablet computer.

The present invention may be a system, a method, and/or a computerprogram product. The computer program product may include a computerreadable storage medium (or media) having computer readable programinstructions thereon for causing a processor to carry out aspects of thepresent invention.

The computer readable storage medium can be a tangible device that canretain and store instructions for use by an instruction executiondevice. The computer readable storage medium may be, for example, but isnot limited to, an electronic storage device, a magnetic storage device,an optical storage device, an electromagnetic storage device, asemiconductor storage device, or any suitable combination of theforegoing. A non-exhaustive list of more specific examples of thecomputer readable storage medium includes the following: a portablecomputer diskette, a hard disk, a random access memory (RAM), aread-only memory (ROM), an erasable programmable read-only memory (EPROMor Flash memory), a static random access memory (SRAM), a portablecompact disc read-only memory (CD-ROM), a digital versatile disk (DVD),a memory stick, a floppy disk, a mechanically encoded device such aspunch-cards or raised structures in a groove having instructionsrecorded thereon, and any suitable combination of the foregoing. Acomputer readable storage medium, as used herein, is not to be construedas being transitory signals per se, such as radio waves or other freelypropagating electromagnetic waves, electromagnetic waves propagatingthrough a waveguide or other transmission media (e.g., light pulsespassing through a fiber-optic cable), or electrical signals transmittedthrough a wire.

Computer readable program instructions described herein can bedownloaded to respective computing/processing devices from a computerreadable storage medium or to an external computer or external storagedevice via a network, for example, the Internet, a local area network, awide area network and/or a wireless network. The network may comprisecopper transmission cables, optical transmission fibers, wirelesstransmission, routers, firewalls, switches, gateway computers and/oredge servers. A network adapter card or network interface in eachcomputing/processing device receives computer readable programinstructions from the network and forwards the computer readable programinstructions for storage in a computer readable storage medium withinthe respective computing/processing device.

Computer readable program instructions for carrying out operations ofthe present invention may be assembler instructions,instruction-set-architecture (ISA) instructions, machine instructions,machine dependent instructions, microcode, firmware instructions,state-setting data, or either source code or object code written in anycombination of one or more programming languages, including an objectoriented programming language such as Smalltalk, C++ or the like, andconventional procedural programming languages, such as the “C”programming language or similar programming languages. The computerreadable program instructions may execute entirely on the user'scomputer, partly on the user's computer, as a stand-alone softwarepackage, partly on the user's computer and partly on a remote computeror entirely on the remote computer or server. In the latter scenario,the remote computer may be connected to the user's computer through anytype of network, including a local area network (LAN) or a wide areanetwork (WAN), or the connection may be made to an external computer(for example, through the Internet using an Internet Service Provider).In some embodiments, electronic circuitry including, for example,programmable logic circuitry, field-programmable gate arrays (FPGA), orprogrammable logic arrays (PLA) may execute the computer readableprogram instructions by utilizing state information of the computerreadable program instructions to personalize the electronic circuitry,in order to perform aspects of the present invention.

Aspects of the present invention are described herein with reference toflowchart illustrations and/or block diagrams of methods, apparatus(systems), and computer program products according to embodiments of theinvention. It will be understood that each block of the flowchartillustrations and/or block diagrams, and combinations of blocks in theflowchart illustrations and/or block diagrams, can be implemented bycomputer readable program instructions.

These computer readable program instructions may be provided to aprocessor of a general purpose computer, special purpose computer, orother programmable data processing apparatus to produce a machine, suchthat the instructions, which execute via the processor of the computeror other programmable data processing apparatus, create means forimplementing the functions/acts specified in the flowchart and/or blockdiagram block or blocks. These computer readable program instructionsmay also be stored in a computer readable storage medium that can directa computer, a programmable data processing apparatus, and/or otherdevices to function in a particular manner, such that the computerreadable storage medium having instructions stored therein comprises anarticle of manufacture including instructions which implement aspects ofthe function/act specified in the flowchart and/or block diagram blockor blocks.

The computer readable program instructions may also be loaded onto acomputer, other programmable data processing apparatus, or other deviceto cause a series of operations to be performed on the computer, otherprogrammable apparatus or other device to produce a computer implementedprocess, such that the instructions which execute on the computer, otherprogrammable apparatus, or other device implement the functions/actsspecified in the flowchart and/or block diagram block or blocks.

The flowchart and block diagrams in the Figures illustrate thearchitecture, functionality, and operation of possible implementationsof systems, methods, and computer program products according to variousembodiments of the present invention. In this regard, each block in theflowchart or block diagrams may represent a module, segment, or portionof instructions, which comprises one or more executable instructions forimplementing the specified logical function(s). In some alternativeimplementations, the functions noted in the block may occur out of theorder noted in the figures. For example, two blocks shown in successionmay, in fact, be executed substantially concurrently, or the blocks maysometimes be executed in the reverse order, depending upon thefunctionality involved. It will also be noted that each block of theblock diagrams and/or flowchart illustration, and combinations of blocksin the block diagrams and/or flowchart illustration, can be implementedby special purpose hardware-based systems that perform the specifiedfunctions or acts or carry out combinations of special purpose hardwareand computer instructions.

The descriptions of the various embodiments of the present inventionhave been presented for purposes of illustration, but are not intendedto be exhaustive or limited to the embodiments disclosed. Manymodifications and variations will be apparent to those of ordinary skillin the art without departing from the scope and spirit of the invention.The terminology used herein was chosen to best explain the principles ofthe embodiment, the practical application or technical improvement overtechnologies found in the marketplace, or to enable others of ordinaryskill in the art to understand the embodiments disclosed herein.

What is claimed is:
 1. A method for transferring data in a MapReduceframework comprising a mapper system and a reducer system, the methodcomprising: receiving, by one or more computer processors, a data splitassigned to a mapper system; registering, by one or more computerprocessors, a first fixed-address memory region for the mapper system tobe used for a remote direct memory access (RDMA) reducer buffer, whereinthe first fixed-address memory region is a locked memory region,expressed as a fixed memory address followed by a specified byte range,that is separated from dynamic memory regions on a virtual machine usedby the mapper system, and wherein registering includes determining asize for the RDMA reducer buffer to be created based on a size of anRDMA mapper buffer of the mapper system to which the reducer system isassigned, wherein the RDMA reducer buffer is sized such that no data isspilled to disk; executing, by one or more computer processors, one ormore mapper tasks on the data split to generate output results;spilling, by one or more computer processors, generated output resultsto the first fixed-address memory region, such that no data is spilledto disk; transferring, by one or more computer processors, generatedoutput results from the first fixed-address memory region to the reducersystem using RDMA, wherein transferring includes performing an RDMAtransfer of generated output results from the first fixed-address memoryregion to a second fixed-address memory region; sorting, by one or morecomputer processors, the output results in the second fixed-addressmemory region; and transferring, by one or more computer processors, thesorted output results from the second fixed-address memory region to aprimary memory buffer, such that no data is spilled to disk.
 2. Themethod of claim 1, further comprising: registering, by one or morecomputer processors, the second fixed-address memory region for thereducer system.
 3. The method of claim 1, wherein the RDMA transfer isperformed using both InfiniBand and RDMA over Converged Ethernet (RoCE).4. The method of claim 1, further comprising: transferring, by one ormore computer processors, the generated output results from the secondfixed-address memory region to a dynamic memory region; and sorting, byone or more computer processors, the generated output results in thedynamic memory region.
 5. A computer program product for transferringdata in a MapReduce framework comprising a mapper system and a reducersystem, the computer program product comprising: one or more computerreadable storage memory and program instructions stored on the one ormore computer readable storage memory, the program instructionscomprising: program instructions to receive a data split assigned to amapper system; program instructions to register a first fixed-addressmemory region for the mapper system to be used for a remote directmemory access (RDMA) reducer buffer, wherein the first fixed-addressmemory region is a locked memory region, expressed as a fixed memoryaddress followed by a specified byte range, that is separated fromdynamic memory regions on a virtual machine used by the mapper system,and wherein registering includes determining a size for the RDMA reducerbuffer to be created based on a size of an RDMA mapper buffer of themapper system to which the reducer system is assigned, wherein the RDMAreducer buffer is sized such that no data is spilled to disk; programinstructions to execute one or more mapper tasks on the data split togenerate output results; program instructions to spill generated outputresults to the first fixed-address memory region, such that no data isspilled to disk; program instructions to transfer generated outputresults from the first fixed-address memory region to the reducer systemusing RDMA, wherein transferring includes performing an RDMA transfer ofgenerated output results from the first fixed-address memory region to asecond fixed-address memory region; sorting, by one or more computerprocessors, the output results in the second fixed-address memoryregion; and transferring, by one or more computer processors, the sortedoutput results from the second fixed-address memory region to a primarymemory buffer.
 6. The computer program product of claim 5, wherein theprogram instructions stored on the one or more computer readable storagememory further comprise: program instructions to register a secondfixed-address memory region for the reducer system.
 7. The computerprogram product of claim 5, wherein the RDMA transfer is performed usingboth InfiniBand and RDMA over Converged Ethernet (RoCE).
 8. The computerprogram product of claim 5, wherein the program instructions stored onthe one or more computer readable storage memory further comprise:program instructions to transfer the generated output results from thesecond fixed-address memory region to a dynamic memory region; andprogram instructions to sort the generated output results in the dynamicmemory region.
 9. A computer system for transferring data in a MapReduceframework comprising a mapper system and a reducer system, the computersystem comprising: one or more computer processors; one or more computerreadable storage memory; program instructions stored on the one or morecomputer readable storage memory for execution by at least one of theone or more processors, the program instructions comprising: programinstructions to receive a data split assigned to a mapper system;program instructions to register a first fixed-address memory region forthe mapper system to be used for a remote direct memory access (RDMA)reducer buffer, wherein the first fixed-address memory region is alocked memory region, expressed as a fixed memory address followed by aspecified byte range, that is separated from dynamic memory regions on avirtual machine used by the mapper system, and wherein registeringincludes determining a size for the RDMA reducer buffer to be createdbased on a size of an RDMA mapper buffer of the mapper system to whichthe reducer system is assigned, wherein the RDMA reducer buffer is sizedsuch that no data is spilled to disk; program instructions to executeone or more mapper tasks on the data split to generate output results;program instructions to spill generated output results to the firstfixed-address memory region, such that no data is spilled to disk;program instructions to transfer generated output results from the firstfixed-address memory region to the reducer system using RDMA, whereintransferring includes performing an RDMA transfer of generated outputresults from the first fixed-address memory region to a secondfixed-address memory region; sorting, by one or more computerprocessors, the output results in the second fixed-address memoryregion; and transferring, by one or more computer processors, the sortedoutput results from the second fixed-address memory region to a primarymemory buffer.
 10. The computer system of claim 9, wherein the programinstructions stored on the one or more computer readable storage memoryfurther comprise: program instructions to register a secondfixed-address memory region for the reducer system.
 11. The computersystem of claim 9, wherein the RDMA transfer is performed using bothInfiniBand and RDMA over Converged Ethernet (RoCE).
 12. The computersystem of claim 9, wherein the program instructions stored on the one ormore computer readable storage memory further comprise: programinstructions to transfer the generated output results from the secondfixed-address memory region to a dynamic memory region; and programinstructions to sort the generated output results in the dynamic memoryregion.